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Abstract 

Studies of microbial evolutionary dynamics are being transformed by the availability of affordable high-throughput 
sequencing technologies, which allow whole-genome sequencing of hundreds of related taxa in a single study. 
Reconstructing a phylogenetic tree of these taxa is generally a crucial step in any evolutionary analysis. Instead of 
constructing genome assemblies for all taxa, annotating these assemblies, and aligning orthologous genes, many 
recent studies 1) directly map raw sequencing reads to a single reference sequence, 2) extract single nucleotide poly- 
morphisms (SNPs), and 3) infer the phylogenetic tree using maximum likelihood methods from the aligned SNP positions. 
However, here we show that, when using such methods to reconstruct phylogenies from sets of simulated sequences, both 
the exclusion of nonpolymorphic positions and the alignment to a single reference genome, introduce systematic biases 
and errors in phylogeny reconstruction. To address these problems, we developed a new method that combines 
alignments from mappings to multiple reference sequences and show that this successfully removes biases from the 
reconstructed phylogenies. We implemented this method as a web server named REALPHY (Reference sequence 
Alignment-based Phylogeny builder), which fully automates phylogenetic reconstruction from raw sequencing reads. 
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Introduction 

One of the unifying goals across fields as diverse as evolutionary 
biology, epidemiology, and ecology is to understand the evo- 
lutionary relationships between different taxa (Preston et al. 
1998; Gill et al. 2005; Ishii et al. 2006; Chun et al. 2009; Ogura 
et al. 2009; Harris et al. 2010), which are typically quantified by 
constructing phylogenetic trees (Nei and Kumar 2000). 
Recently, our ability to resolve such trees has greatly improved 
due to the rate at which sequence data can be generated via 
high-throughput sequencing methods. However, using high- 
throughput sequencing data to precisely determine phyloge- 
netic relationships between taxa is not trivial. 

Traditionally, phylogenies are reconstructed from whole- 
genome sequence data by 1) assembling sequence reads into 
contigs; 2) annotating open reading frames; 3) identifying 
orthologous open reading frames across all genomes; 4) align- 
ing orthologous coding regions; and 5) reconstructing a phy- 
logenetic tree from these multiple alignments (Touchon et al. 
2009; Luo et al. 201 1; Rodriguez-R et al. 2012). Subsequently, a 
phylogenetic tree is then reconstructed from the alignments, 
typically by applying maximum likelihood methods such as 
RAxML (Stamatakis 2006) or PhyML (Guindon et al. 2010), or 
Bayesian methods such as PhyloBayes (Lartillot et al. 2009) or 
AArBayes (Huelsenbeck and Ronquist 2001). 

Although it is generally accepted that this method 
allows accurate reconstruction of phylogenetic trees 



(Rosenberg and Kumar 2003), the series of steps involved is 
not only time consuming but requires a sophisticated com- 
bination of bioinformatic methods. 

Recently, an alternative method that is simpler and less 
time consuming has been applied in several large-scale mi- 
crobial studies (Harris et al. 2010; Epstein et al. 2012; Harris 
et al. 2012; McCann et al. 2013; Cui et al. 2013). In this method, 
raw short-sequence reads from each taxon are directly 
mapped to the genome sequence of a single reference. 
Homologous sites from all taxa (and in some studies only 
those sites containing single nucleotide polymorphisms 
[SNPs]) are then concatenated into a multiple sequence 
alignment from which the phylogenetic tree is reconstructed. 

There are reasons to suspect that such reference-mapping- 
based phylogeny reconstruction methods might introduce 
systematic errors. First, multiple alignments are traditionally 
constructed progressively, that is, starting by aligning the 
most closely related pairs and iteratively aligning these 
subalignments (Notredame et al. 2000; Chenna et al. 2003). 
Aligning all sequences instead to a single reference is likely to 
introduce biases. For example, reads with more SNPs are less 
likely to successfully and unambiguously align to the reference 
sequence, as is common in alignments of more distantly re- 
lated taxa. This mapping asymmetry between strains that are 
closely and distantly related to the reference sequence may 
affect the inferred phylogeny, and this has indeed been 
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observed (Spencer et al. 2007). Second, as maximum likeli- 
hood methods explicitly estimate branch lengths, including 
only alignment columns that contain SNPs and excluding 
(typically many) columns that are nonpolymorphic, may 
also affect the topology of the inferred phylogeny. This 
effect has been described before for morphological traits 
(Lewis 2001) and is one reason long-branch attraction can 
be alleviated with maximum likelihood methods when 
nonpolymorphic sites are included in the alignment 
(Felsenstein 1981). Furthermore, the more general issue of 
selectively leaving out data from multiple sequence align- 
ments has been studied recently and found to affect tree 
topology (Shavit Grievink et al. 2013). 

By simulating sequence evolution across small phylogenies 
of known topology, we identify parameter regimes where the 
combination of single-taxon reference mapping and SNP ex- 
traction generally leads to severe errors in phylogeny recon- 
struction. These simulations also show that even when 
including nonpolymorphic sites in an alignment, the effect 
of mapping to a single reference can lead to systematic errors. 
In particular, we find that when some taxa are diverged by 
more than 5-10% from the reference, the distance to the 
reference is systematically underestimated. This can generate 
incorrect tree topologies, especially when other branches in 
the tree are short. Moreover, using data from a set of 21 
Escherichia coli genomes, a set of 19 Pseudomonas syringae 
genomes, and a set of 32 Sinorhizobium meliloti genomes, we 
show that biases due to mapping to a single reference and 
exclusion of nonpolymorphic sites significantly affect the in- 
ferred phylogenetic trees for realistic data sets. 

To alleviate these problems, we present a method that 
combines alignments obtained by mapping reads to not 
one but to multiple reference sequences. Applying this 
method to both the simulated and real data sets suggests 
that, by combining sequence mappings to multiple refer- 
ences, mapping biases can be avoided and accurate phylog- 
enies can be reconstructed when each taxon is close (i.e., <5% 
divergence) to at least one of the reference sequences. To 
make this phylogeny reconstruction procedure available to 
researchers, including experimental biologists without specific 
expertise in bioinformatics, we have implemented this 
method as a web server called Reference sequence 
Alignment-based Phylogeny (REALPHY) builder (available at 
http://realphy.unibas.ch, last accessed March 13, 2014). 
REALPHY takes as input raw short-sequence read data sets 
and reconstructs phylogenies by aligning the reads to one or 
more reference sequences. 

Results and Discussion 

The inference of phylogenetic trees from collections of poly- 
morphic sites identified by mapping short-sequence reads 
from multiple genomes to a single reference genome is an 
increasingly common practice (Harris et al. 2010; Epstein et al. 
2012; Holt et al. 2012; McAdam et al. 2012; Okoro et al. 2012; 
Cui et al. 2013). However, as indicated in the Introduction, 
there are several reasons to suspect that this method may 
introduce systematic errors. 



To test in what situations this method may result in in- 
correct tree reconstruction, we simulated sequence evolution 
along known phylogenies, systematically varying both topol- 
ogy (i.e., the placement of the reference genome) and branch 
lengths. For each data set, we then compared the true tree 
topology with the tree topologies inferred from 1) the correct 
and complete alignment of the evolved sequences; 2) the 
alignment obtained after mapping short reads and retaining 
only SNP positions; and 3) the alignment after mapping short 
reads and retaining both SNPs and nonpolymorphic sites. 

Sequence Simulation 

Tree Shapes and Branch Lengths 

To allow a systematic exploration of parameter space, we 
restricted our analysis to unrooted four-taxon trees, which 
have only five branches and only three possible topologies: 
(A,B),(C,D); (A,C),(B,D); and (A,D),(B,C) (fig. 1A). Throughout, 
we use taxon A (fig. 1A) as the reference sequence to which 
short-sequence reads from all other taxa are mapped. To 
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Fig. 1. Tree shapes and branch lengths used to simulate sequence 
evolution. (A) The three possible topologies in a four-taxon tree. 
(B) The sample space of tree topologies. Each axis indicates the diver- 
gence along one set of branches: the divergence of the red branches is 
indicated along the x-axis and the divergence of the blue branches is 
indicated along the y-axis. We sampled at five points along each axis, 
that is, at 0.5, 1, 2, 4, and 8% divergence, for a total of 25 different 
combinations of branch lengths. (C) All possible tree shapes are consid- 
ered in the analyses. There are 1 1 total tree shapes in a four-taxon tree 
that divide the branches into two types (shown here as red and blue). 
In all our analyses, the reference node is the lower left node of the tree. 
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understand the effects of differences in branch length on tree 
reconstruction, we considered all ways of partitioning the five 
branches into two subsets (the red and blue branches in fig. 
1C). Because our four-taxon tree is asymmetric (the sequence 
from one taxon is designated as the reference sequence), 
there are 11 possible groupings of the branches, which we 
call tree shapes. For each tree shape, we varied the branch 
lengths in these two groups over a range of values (0.5%, 1%, 
2%, 4%, and 8% divergence). This gave rise to 25 possible 
branch length combinations, shown as grid points in figure 
1 B. Varying tree shape and branch lengths in this manner gave 
rise to a total of 275 (25 x 11) different trees. 

To refer to any of these trees individually, we specify each 
of the parameters varied above: first, the tree shape (1-11), 
followed by the divergence level of the majority set of 
branches (i.e., the blue branches in fig. 1C), and finally the 
divergence level of the minority set of branches (i.e., the red 
branches in fig. 1C). We represent the results for different 
branch length combinations in a matrix, for example, trees 
with a divergence of 0.5% over blue branches and 8% over red 
branches correspond to the bottom-right corner of figure IB. 

Recombination 

Recombination (gene conversion) occurs frequently in bacte- 
rial species (Didelot and Maiden 2010). Thus, in addition to 
varying tree shape and branch lengths, we investigated the 
effect of short-read mapping on phylogeny reconstruction in 
the presence of recombination. To simulate this process, 10% 
of the nucleotides in the reference sequence were replaced 
with the orthologous nucleotides from the sequence of its 
cousin taxon (taxon D in fig. 1; using taxon C would yield 
identical results). Thus, with the inclusion of recombination, 
we simulated sequence evolution over a total of 550 trees 
(2 X 275). 

The Impact of SNP Extraction and Read Mapping Bias 
on Tree Topology 

Accurate Phylogenetic Reconstruction When Using the True 
Alignment 

We first tested whether the correct tree topology could be 
reliably recovered from the true alignment (the evolution of 
100,000 nucleotides simulated along a four-taxon tree). We 
found that when there was no recombination, all tree topol- 
ogies were reconstructed correctly by PhyAAL (Guindon et al. 
2010), a maximum likelihood tree inference program. Not 
surprisingly, when a sufficient amount of recombination 
was incorporated, phylogeny reconstruction was no longer 
error-free (supplementary fig. SI, Supplementary Material 
online). 

Phylogenies Reconstructed Using Only SNP Positions 
Are Unreliable 

We then tested for parameter regimes that led to incorrect 
phylogenies when mapping to a single reference, extracting 
SNP positions only, and reconstructing a phylogeny using 
maximum likelihood. We identified 131 different parameter 
settings for which incorrect topologies were inferred for a 
fraction of data sets, even in the absence of recombination 



(fig. 2). Up to 100% of all inferred tree topologies were incor- 
rect for some parameter sets (e.g., for tree shape 1 at 1% and 
4% divergence; fig. 2). This contrasted strongly with the results 
using the true alignment, for which no incorrect topologies 
were inferred for any of the parameter settings. 

When recombination was included, the reliability of phy- 
logenetic reconstruction using only SNP positions decreased 
further. There were 140 parameter sets for which incorrect 
topologies were inferred, and the number of incorrectly in- 
ferred trees increased from 6,641 to 8,871 out of a total of 
27,500 data sets. 

Importantly, we also found that the choice of the reference 
taxon affected error rates (supplementary fig. SI, 
Supplementary Material online). For example, although tree 
shapes three and four have identical branch lengths and differ 
only in the position of the reference sequence, the accuracy of 
tree reconstruction differed considerably. When the reference 
taxon was on a short branch (0.5% divergence) and all other 
branches were long (8% divergence), no errors were made in 
inferring the topology. In contrast, when the sister taxon of 
the reference was on a short branch and all other branches 
were long, errors were made in 82% of all cases. 

Including Nonpolymorphic Sites Improves Reliability 
The above analyses were performed on alignments containing 
only SNP positions. When nonpolymorphic positions were 
included in the alignments (i.e., all nonpolymorphic positions 
that were successfully mapped to the reference genome), the 
accuracy of phylogenetic inference improved. Erroneous to- 
pologies were reconstructed for only a single parameter set, in 
tree shape eight: when the branch of the reference's sister 
taxon and the internal branch were short (0.5%) and all other 
branches were long (8%), the incorrect topology was inferred 
in 12% of all simulations (fig. 2). When recombination was 
included, the accuracy again decreased strongly for five pa- 
rameter combinations compared with the true alignments 
(supplementary fig. SI, Supplementary Material online). 

Thus, when aligning short reads to a single reference 
genome, there were still some parameter sets for which 
trees could not be reliably reconstructed. However, for the 
same parameter sets, no inaccuracies arose when trees were 
inferred without reference mapping (i.e., using the correct and 
complete alignment). This demonstrates that the inaccuracy 
in phylogenetic reconstruction was due to biases that arose in 
mapping short reads to a single reference sequence. 

It is likely that the inaccurately inferred tree topologies are 
caused primarily by a combination of two factors: 1) Short- 
read aligners such as Bowtie2 can only map sequences closely 
related to the reference, such that sequences with too many 
mismatches are discarded and 2) the relative distance to the 
reference is important, as regions that are on average more 
closely related to the reference are less likely to be discarded 
than regions that are more distant to the reference. Figure 3 
qualitatively illustrates how biases are introduced. Assuming 
that the alignment algorithm only allows a single mismatch 
between the query and reference sequence within a short 
region, only a single mutation in the branch leading to the 
reference would be allowed, and any additional mutations in 
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the region would cause the alignments to be discarded 
(fig. 3A). In contrast, three separate mutations would be al- 
lowed in the terminal branches that lead to the other leaves in 
the tree (fig. 3C). As a consequence, the fraction of columns 
having identical nucleotides in all taxa would be inflated, 
whereas the fraction of columns in which all nucleotides 
would be equal except for the nucleotide in the reference is 
underestimated (supplementary fig. S2, Supplementary 
Material online). As shown in supplementary figure S3, 
Supplementary Material online, such biases decrease the 
extent to which the likelihood function supports the correct 
phylogeny over incorrect alternative topologies, and this is 
most dramatic for the problematic tree shape 8 (supplemen- 
tary fig. S3B, Supplementary Material online). 

Branch Lengths Are Highly Inaccurate When Using SNP 
Positions Only 

To analyze how these biases affect branch lengths, we quan- 
tified branch lengths from all phylogenetic trees that were 
correctly reconstructed: 1) from the true alignments; 2) from 
alignments obtained after short-read mapping and SNP ex- 
traction (without nonpolymorphic sites); and 3) from the full 
alignments obtained after short-read mapping (including all 
nonpolymorphic sites). Because we obviously expected to 
infer longer overall branch lengths when including SNP posi- 
tions alone, we assessed the accuracy of the inferred relati\je 
branch lengths instead of total branch lengths. We defined 
the relative length of a branch as its length divided by the sum 
of all branch lengths within the tree. To quantify the effects of 
reference mapping and SNP extraction on tree reconstruc- 
tion, we determined, for each branch in the tree, the ratio of 
its relative length after mapping and SNP extraction, to the 
relative length of the branch inferred from the true and com- 
plete alignment. 

We found again that accuracy was low when using single 
reference-mapped alignments containing SNP positions 
alone. The inferred branch lengths in these phylogenies dif- 
fered considerably from the true branch lengths (fig. 4A), and 
we found that even at relatively low levels of divergence 
(5.9%), in at least 13% of all reconstructions, each of the 



five branches was estimated to be less than one-tenth of 
their true relative length. 

By including both nonpolymorphic and SNP positions, the 
tree reconstruction accuracy increased considerably (fig. 4B). 
When total divergence across the true tree was less than 10%, 
branch length estimation was generally accurate, and branch 
lengths were only rarely under- or overestimated by more 
than 10%. However, at higher divergence levels, accuracy 
decreased rapidly. This decrease in accuracy also exposed a 
consistent bias, in which the lengths of the interior branch 
and the branch leading to the reference were consistently 
underestimated, while the lengths of the other branches in 
the tree were consistently overestimated. This confirmed our 
qualitative considerations above regarding the systematic 
biases that mapping to a single reference introduces. 

Combining Alignnnents from Mappings to Multiple Reference 
Taxa Allows for Accurate and Unbiased Phylogeny 
Reconstruction 

Although the inclusion of nonpolymorphic sites in the align- 
ment considerably improved the accuracy of tree reconstruc- 
tion, there were parameter regimes where topologies could 
still not be reconstructed correctly. Furthermore, relative 
branch lengths were inferred to be up to two times longer/ 
shorter than they ought to be (fig. 4B). Earlier, we have argued 
that this bias is caused by the relative position of the reference 
to the other sequences in the phylogeny. This suggests that 
this bias may be overcome by using multiple references that 
are more evenly distributed across the tree. However, as de- 
tailed in Materials and Methods, care has to be taken that no 
other systematic biases are introduced when combining align- 
ments from mappings to multiple references. We therefore 
developed an iterative procedure for merging alignment col- 
umns from mappings to different references into a final 
nonredundant alignment, ensuring that each genomic posi- 
tion from each reference occurs in at most one column of the 
final alignment and that conflicts between the mappings 
using different references are resolved. 

To test whether this strategy allowed us to create more 
accurate phylogenies, we focused on the parameter setting 




Reference : 
Sister: 
Cousin 1: 
Cousin 2: 



CI 



C2 



ATATGCAGTTACAGTACACTA 
ATATGCAGTAACAGTACACTA 
ATATGCAGTAACAGTACATAG 
ATATGCAGTAACAGTACATAG 



B 





ATATGCAGTAACAGTACACTA 
ATATGCAGTTACAGTACACTA 
ATATGCAGTAACAGAACACTA 
ATATGCAGTAACAGAACACTA 



ATATGCAGTAACAGTACACTA 
ATATGCAGTTACAGTACACTA 
ATATGCAGTAAGAGTACACTA 
ATATGCAGTAACAGTACACTA 



Fig. 3. Mapping to a single reference introduces alignment biases. Assuming for illustrative purposes, that the alignment algorithm allows only one 
mismatch between query and reference within a 21 -bp region, each panel shows the maximal number of mutations allowed in order for successful 
mapping of all orthologous fragments to occur, as a function of the positions in the tree where mutations occur. (A) If a single mutation occurs on the 
reference branch, then the distance from the reference to all other sequences reaches one immediately, and no further mutations are allowed. (B) One 
mutation on the internal branch as well as one mutation on the sister branch are allowed before all three query sequences reach a distance of one to the 
reference. (C) Three independent mutations on each of the external branches are allowed before all query sequences reach a distance of one to the 
reference. 



1081 



Bertels et al. 



doi:iai093/nnolbev/nnsu088 



MBE 



u. 



10% 20% 30% 

Divergence 

Ref 



10% 20% 30% 



10% 20% 30% 



10% 20% 30% 



10% 20% 30% 



K 



H 



Inferred branch length 
True branch length 



B 



>10 
2<x<10 
1.1<x<2 
0.9<x<1.1 
0.5<x<0.9 
0.1<x<0.5 
<0.1 



I 
I 



10% 20% 

Divergence 



10% 20% 



Fig. 4. Deviation of relative branch lengths, as inferred from mapped sequence alignments, from the true relative branch lengths for (A) phylogenies 
inferred using SNP positions only and (B) phylogenies inferred using all positions. For each branch in our simulated four-taxon trees, the figure shows the 
proportion of trees in which the estimated relative branch length deviated from the true relative branch length to a certain degree (color).The trees 
were subdivided into six equally sized bins based on the overall divergence level (proportion of columns within the original multiple sequence alignment 
that contain SNPs) and the branch length ratios were calculated for each divergence class (position on thex-axis). The proportion of trees inferred from 
mapped sequence alignments that contain relative branch lengths that are more than ten times greater than those from the true tree are shown in dark 
blue. Relative branch lengths that are more than ten times shorter are shown in dark red. Relative branch lengths that are within 10% of the true branch 
length are shown in white (see legend). The figure shows one plot for each of the five branches within the tree (this branch is indicated in green in the 
four-taxon trees between A and B). The reference sequence is always the taxon on the bottom left of the tree. Trees were only included in the statistics if 
the mapped tree topology matched the true (known) tree topology. 



that caused incorrect topologies even when nonpolymorphic 
sites are included (tree shape 8 with 0.5% and 8% divergence) 
and built four separate alignments by using each of the taxa as 
a reference. After merging the alignments, the correct phy- 
logeny was reconstructed in 100% of the cases. Furthermore, 
whereas relative branch lengths differed by up to 2'fold from 
the true alignment when using a single reference and when 
reconstructing the phylogeny from the merged alignment, 
relative branch lengths differed by at most 18% (fig. 5). This 
demonstrated that mapping sequences to multiple reference 
taxa allows for much more accurate tree reconstruction, even 
for substantially divergent sequences. 

We have now implemented this new method as a web 
server, REALPHY, to make this resource widely available. 

Application to Bacterial Genome Sequences 
The simulations presented above show under what condi- 
tions mapping to a single reference and inferring phylogenetic 
trees from SNP positions can lead to errors even for simple 
four-taxon trees. Here we show that these errors do typically 
occur in realistic data sets and that by merging alignments 
from multiple references, REALPHY avoids such errors. We 
analyzed three published data sets with sequences from E. coli 
(Touchon et al. 2009), P. syringae (Baltrus et al. 2011), and 
S. meliloti (Epstein et al. 2012). The first two data sets dem- 
onstrate how biases from mapping to a single reference can 
affect the inferred phylogeny. In addition, they allow us to 
compare phylogenies constructed by REALPHY with those 
using classical alignment methods (Touchon et al. 2009; 
Baltrus et al. 2011) and demonstrate that, as a consequence 




Fig. 5. Accuracy of estimated relative branch lengths when inferring a 
phylogeny from a single reference alignment (gray bars) and from a 
merged alignment of all four references (white bars). The relative branch 
length (BL) of a particular branch is defined as the length of the branch 
divided by the sum of all BLs in the tree. The BL ratio is the ratio of the 
estimated BL and the BL of the true tree. The bars show the BL ratios for 
each of the five branches (indicated at the bottom) of the trees inferred 
in 88 independent trials (all correctly reconstructed topologies) of align- 
ments from tree shape eight with divergences of 0.5% and 8%. Note that 
the closer the bars are to one, the more similar the estimated tree is to 
the true tree. 

of the larger number of sites that are included in REALPHY 
analyses, we obtain more accurate phylogenies. The S. meliloti 
data set illustrates how the use of only SNP positions can lead 
to errors in the reconstructed tree. 

Data from E. coli 

Touchon et al. (2009) determined the phylogeny of 20 fully 
sequenced £ coli and Shigella strains as well as one Lfergusonii 
strain using classical methods. They used whole-genome data 
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Fig. 6. Comparison of REALPHY phylogenies to phylogenies inferred in previous publications. Both REALPHY trees (green) were built using PhyML, with 
the general time-reversible (GTR) model of nucleotide evolution and gamma distributed rate variation. The annotation on the branch points in black 
denotes the bootstrap support for the branch points from a total of 100 bootstrap experiments (only shown if <100) for REALPHY trees, Bayesian 
probabilities for the Baltrus tree (shown if <0.95) and bootstrap values out of 1,000 for the Touchon tree (shown if < 1,000). Annotations in gray show 
the number of REALPHY single-reference trees that support the particular branch points (only shown if <21 for £. coli and <3 for P. syringae). Boxed 
parts of the trees contain differences to the previously published corresponding tree. (A) £. coli phylogeny reconstructed by Touchon et al. (2009) (left) 
compared with a phylogeny reconstructed from all 21 merged reference alignments produced by REALPHY. The differences between the two trees are 
the placements of £. coli 536 and S88. (B) P. syringae phylogeny reconstructed by Baltrus et al. (2011) (left) compared with a phylogeny based on 
mappings to the three fully sequenced P. syringae strains: P. syringae B728a, P. syringae pi/, phaseolicola 1448a and P. syringae pi/, tomato DC3000. Right: 
The root of the tree was arbitrarily selected to facilitate comparison between the two topologies. When inferring trees from single reference genome 
alignments, two branch points are not supported by all three trees (annotated on the corresponding branches). These branch points concern the 
placement of Cit7 (P. syringae B728a as reference) and Pae (P. syringae pi/, phaseolicola 1448a as reference). 



and reciprocal best BlastP to identify 1,878 genes that were 
present in all genomes. These orthologs were aligned (cover- 
ing ~40% of the shortest genome's length), and a distance 
matrix based on this alignment was calculated and used to 
build a neighbor-joining tree. 

We applied REALPHY with default parameters to the same 
data set, performing 21 separate runs using each of the 21 
taxa as a single reference sequence. The topologies of the 
inferred trees were almost identical. We identified only one 
branch point at which not all 21 phylogenies agreed. This 
branch point concerned the subclade containing £ coli 
APEC 01, UTI89 and S88 (fig. 6). Here, in 9 out of the 21 
cases, instead of APEC 01 clustering with S88, APEC 01 clus- 
tered with UTI89. 

We also merged all reference alignments using our merging 
procedure (1,896,194 bp total length; 170,886 SNP positions, 
covering 43% of the shortest genome's length) and inferred a 
tree for the combined alignment. The tree inferred for this 
merged alignment was identical to the consensus tree ob- 
tained with 12 of the 21 different references. 

We found that REALPHY's tree differed at two branch 
points (both in clade B2) from the tree calculated by 
Touchon et al. (2009) (fig. 6A). The first branch point 



concerned the aforementioned UTI89, APEC 01, and S88 
clade. The fact that this branch point was only supported 
by 12 of the 21 reference alignments suggests the data is 
indeed not completely unambiguous, and this is substanti- 
ated by a bootstrap experiment with 100 repeats showing a 
support of 88% for this branch, whereas all other branches 
have 100% support. The second branch point that differed 
between ours and the Touchon tree concerned the place- 
ment of £ coli 536, which was placed at the root of the B2 
clade in our reconstruction but clustered with CFT073 and 
EDla in the tree reconstructed by Touchon et al. (2009). 
Notably, Touchon et al. also presented a second tree based 
on a MAUVE alignment. In this case, 536 was placed as out- 
group to the B2 subclade, as our consensus and merged-data 
tree did. Furthermore, Touchon et al. showed that the sup- 
port for this branch is only about 92% compared with 100% 
for the rest of the tree. 

The facts that REALPHY uses a larger number of sites (43% 
of the smallest £ coli genome compared with 40% for the 
Touchon et al. [2009] data) and that REALPHY's tree matches 
the consensus tree of all reference alignments suggest that 
REALPHY's tree may be more accurate. To investigate this 
further, we selected only the seven strains from the B2 clade 
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and reran REALPHY on this data set. Because of the much 
higher similarity of this subset of sequences, the reference 
alignments included a much larger number of sites (covering 
about 76% of the shortest B2 genome). We found that the 
tree inferred by REALPHY for the merged alignment was 
identical to the trees inferred for all seven reference align- 
ments (supplementary fig. S4, Supplementary Material on- 
line). Moreover, this tree supported all REALPHY's branches 
from the tree of all 21 taxa, strongly supporting that 
REALPHY's tree was more accurate than the tree constructed 
by Touchon et al. Interestingly, the tree built from the B2 
clade differs from both REALPHY's and the Touchon et al. 
tree in the placement of the CFT073 strain, demonstrating 
that phylogenetic trees can often be further refined by ana- 
lyzing sequences from subclades separately. 

In summary, the analysis of the £ coli data showed that the 
resulting tree can be biased by the reference strain and that 
usage of merged alignments from multiple references avoids 
this bias in this case. It also indicated that REALPHY performs 
at least as well as classical methods that are more complex 
and time consuming and can even outperform these meth- 
ods when it is using a larger number of sites. 

Data from P. syringae 

In our second analysis, we studied a published P. syringae data 
set (Baltrus et al. 2011), consisting of three fully sequenced 
genomes and 16 draft genomes in FASTA format. This se- 
quence set was considerably more divergent than the above £ 
coli data set (~9% compared with ~14%, respectively). As 
discussed earlier, we expect the effects of reference mapping 
bias to increase as sequence divergence increases. Indeed this 
bias becomes apparent when comparing the reference align- 
ment lengths from P. syringae to those of £ coli. Although for 
the £ coli data approximately 43% of the genomes was cov- 
ered by the REALPHY alignments, for the P. syringae genomes, 
this coverage ranges from 17.6% to 18.9%. As may be ex- 
pected, this alignment bias is significant enough to affect 
the inferred P. syringae topology. When we used P. syringae 
B728a as the reference sequence, it was placed as most basal 
taxon in the group II clade instead of Cit7; and when we used 
P. phaseolicola 1448a as reference, P. phaseolicola 1448a was 
placed as the most basal taxon in group III instead of Pae 
(fig. 6B). As both differences concern the clade in which the 
reference strain is present, it is probable that these differences 
are the result of a mapping bias to the reference sequence. 
This bias was removed when we constructed a merged align- 
ment obtained from all three reference genomes. This align- 
ment contained a total of 1,403,205 bp (236,228 SNPs, 
covering 23% of the smallest reference) and the inferred 
tree agreed completely with the consensus tree of the three 
individual reference phylogenies. 

Notably, there were some disagreements between the to- 
pology of our tree and the multilocus sequence typing 
(MLST) tree inferred by Baltrus et al. (2011). As the MLST 
tree is inferred from only a small number of sites, Baltrus et al. 
inferred another tree based on a concatenated alignment of 
324 proteins (corresponding to roughly 6% of the shortest P. 
syringae genome's length), and this phylogeny is more similar 



to the one inferred by REALPHY. In this case, Pma and 
Por_1_6 clustered together, as well as Pto_DC3000 and 
Pla106, agreeing with our inferred topology. As our phylogeny 
is based on an alignment that contains far more sites than 
both the protein alignment and the MLST alignment, this 
suggests that our phylogeny is likely to model the evolution- 
ary relationships between P. syringae strains more accurately 
than the phylogenies presented by Baltrus et al. 

This example further confirms that usage of a single refer- 
ence can significantly bias the resulting topology and that 
REALPHY's inferred phylogenies are often more accurate 
than phylogenies constructed from a smaller number of se- 
lected sites. 

Data from S. meliloti 

In the previous examples, the phylogenies were constructed 
from all alignment sites, that is, both SNP sites and nonpoly- 
morphic sites. To illustrate reconstruction errors that result 
from using only SNP sites, we applied REALPHY to a set of 
S. meliloti strains (Epstein et al. 2012). Because this data set 
consists of very closely related strains that differ only by a 
maximum of ~1%, we do not expect to observe significant 
reference alignment bias (trees inferred from the two refer- 
ences Rm41 and 1021 were identical). However, the usage of 
SNP sites only may affect the inferred phylogeny. To test this, 
we inferred a phylogeny using PhyML from a complete align- 
ment and an alignment containing only SNP sites (fig. 7). We 
found that there is one significant difference between the 
resulting tree topologies, affecting the placement of T094. 
In addition, the relative branch lengths of the tree inferred 
from SNP sites only changed significantly (supplementary fig. 
S5, Supplementary Material online). Interestingly, for the 
S. meliloti data set, relative branch lengths changed more 
severely than for the £ coli and P. syringae data sets, despite 
the fact that both the £ coli and the P. syringae data set are 
more diverged than the S. meliloti data set. These results fur- 
ther highlight the importance of including nonpolymorphic 
sites in alignments from which phylogenies are inferred using 
maximum likelihood methods. 

Conclusion 

In recent years, numerous studies (e.g, Harris et al. 2010; 
Croucher et al. 2011; Mutreja et al. 2011; Holt et al. 2012; 
McAdam et al. 2012) have reconstructed phylogenetic trees 
for large numbers of closely related bacterial strains by map- 
ping short-sequence reads to a reference genome sequence. 
Here, we have analyzed the performance of such methods on 
simulated and real sequence data and have shown that there 
are two primary pitfalls to this approach. The most readily 
apparent is that when SNP alignments are used to construct 
trees with maximum likelihood methods, it can lead to in- 
correct tree topologies and inaccuracy in the inferred branch 
lengths. Furthermore, when query sequences are sufficiently 
divergent from the reference sequence, the most divergent 
regions of the genome may fail to map, and this mapping bias 
may lead to incorrect branch lengths and topologies. Notably, 
the simulations that we have presented here did not include 
any variation in mutation rates across the genome; biases in 
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Fig. 7. Comparison between two phylogenies inferred from a REALPHY alignment of Sinorhizobium meliloti strains (Epstein et al. 2012) including (left) 
and excluding (right) nonpolymorphic alignment sites. The alignments were created by merging the reference alignments from S. meliloti Rm41 and 
1021. The red box highlights differing branch points. Bootstrap support is indicated if below 100%, except for the blue clade where the support is low. 



transitions or transversions; or clustering of mutations due to 
selection; each of these could serve to exacerbate the problem 
of biased sequence mapping. 

To address these pitfalls, we have presented a new method, 
REALPHY, which can successfully avoid biases from mapping 
to a single reference by implementing a procedure for merg- 
ing alignments obtained by mapping to multiple reference 
genomes into a single nonredundant alignment. 

REALPHY was mainly designed to reconstruct phylogenies 
for microbial genomes, that is, bacterial genomes and single- 
celled eukaryotes such as fungi, but it can in principle be 
equally applied to data from higher eukaryotic organisms. 
However, such applications have not been tested yet and, 
as described in Materials and Methods, the computational 
resources that are required increase with the size of the input 
genomic data and may become prohibitive for large eukary- 
otic genomes that contain many repetitive sequences. 

To make this method available to a large community of 
researchers, including pure biologists without bioinformatics 
expertise, we provide REALPHY through a web server, allow- 
ing the fast and automated generation of multiple sequence 
alignments from a variety of genome sequence data formats 
(e.g., Illumina sequence reads, contigs, draft genomes, fully 
sequenced genomes), and the automatic reconstruction of 
phylogenies from these alignments. 

Materials and Methods 

REALPHY Implementation 

A flowchart of the REALPHY implementation is presented in 
figure 8. 



Pipeline Requirements 

The REALPHY pipeline requires the user to provide a set of 
DNA sequences for each taxon to be included in the phylo- 
genetic tree. This set will typically consist of short-sequence 
reads but may also include larger sequences, such as fully or 
partially assembled genomes. In addition, REALPHY requires 
one or more reference sequence sets to which all sequences 
will be aligned. Each reference sequence set should consist of a 
whole-genome sequence, a set of chromosome sequences, or 
a set of contigs. 

Alignment 

Sequence reads from each query genome provided as FASTQ 
formatted files are directly mapped to each of the reference 
sequences using BowtieZ Assembled genomes provided in 
PASTA or GenBank format are divided into all possible sub- 
sequences of 50 bp (default) to be able to efficiently map 
these sequences to a reference genome with Bowtie2. 
REALPHY calls Bowtie2 with the default k-mer length of 22, 
allowing one mismatch within the k-mers to maximize sen- 
sitivity. For each short sequence, only the best mappings are 
retained, that is, when there are n >1 "best" mappings; each 
of the mappings is assigned a weight Mn. For each reference 
sequence, the short-read mappings for all query genomes are 
combined into a multiple alignment containing all ortholo- 
gous positions that can be reliably identified across the refer- 
ence and query genomes. 

It is possible that paralogous fragments from a query 
genome may map to the same position as an ortholog in a 
reference sequence. If these paralogous fragments have di- 
verged, reads from the same query genome may report 
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Fig. 8. Illustration of the individual steps in the REALPHY pipeline (running from top to bottom). All fully sequenced or assembled genomes (FASTA 
and GenBank files) are divided into all overlapping 50-bp subsequences. Short sequences are aligned to individual reference sequences with Bowtie2. 
Alignment columns are created from all pairwise mappings to the references. Individual reference alignments are merged into a single multiple sequence 
alignment. A phylogeny is reconstructed from merged and individual reference alignments via PhyML 



differing nucleotides aligned to the same position of the ref- 
erence genome. To avoid such inconsistent mappings, only 
unambiguous positions are included in the final alignment. 
Unambiguous position assignment results if the weighted 
sum of mappings from the query genome is >10, and 
>95% of the mappings show the same nucleotide. This per- 
centage was chosen to make it unlikely that paralogous map- 
pings would pass the cutoff but would reduce false negatives 
due to sequencing errors, which are relatively common in 
high-throughput sequencing data (Nakamura et al. 2011). 
By default, only those alignment columns are retained in 
which a nucleotide from each of the taxa is present. 

In some cases, a small number of genomes may be highly 
diverged from all reference sequences in some genomic re- 
gions, resulting in no successful alignments. In other cases, 
some genomic regions may be missing entirely. This may be 
due to their absence in the sequencing data set due to uneven 
sequence coverage or due to gene deletions. Even if only a few 
strains are affected by these problems, these regions will be 
missing from the final alignment, as by default, REALPHY only 
includes regions of the genome for which all strains are pre- 
sent in the alignment. Although such situations are by 



definition problematic and may lead to inaccurate phyloge- 
nies, the user can choose to override the default parameters 
and include columns in the alignment in which either all or a 
specified proportion of genomes can have ambiguous or 
missing nucleotides. These missing nucleotides will be repre- 
sented by gaps. Importantly, it has been shown that under 
certain circumstances, phylogenetic trees reconstructed from 
such alignments can be more reliable than trees recon- 
structed from alignments in which gapped positions are omit- 
ted (Shavit Grievink et al. 2013). 

Combining Alignments from Mapping to Different Reference 
Sequences 

The results of the short-read mappings consist of a collection 
of alignment columns where mappings for all taxa exist. The 
easiest procedure for combining alignment columns that 
result from mapping to different references would be to col- 
lect the union of all alignment columns and apply a phylo- 
genetic reconstruction method to this data set. However, 
such a data set would be highly redundant, with a given 
position from a given reference occurring multiple times, 
that is, once for each reference to which it was mapped. 
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More importantly, certain positions may be represented 
more frequendy than others in a full collecdon of alignment 
columns, which is likely to introduce biases in phylogeny 
reconstruction. For example, it is well known that subsdtu- 
tion rates vary over several orders of magnitude for different 
genes within a genome (as reviewed in Rocha 2006). As a 
consequence, positions from slowly evolving genes may be 
reliably mapped to distal reference genomes, whereas posi- 
tions from fast evolving genes can only be mapped to the 
closest reference genomes. Consequently, positions from 
slowly evolving genes are likely to be overrepresented in the 
full collecdon of alignment columns. 

To avoid such biases, REALPHY combines alignment col- 
umns from different references into a final set of alignment 
columns using the following procedure (supplementary fig. 
S6, Supplementary Material online). Alignment columns from 
all alignments are pooled and then iteratively processed as 
follows: 1) Randomly select an alignment (column C) from 
the pool. This column will contain both nucleoddes for 
aligned nonreference genomes (e.g., short-sequence read 
data) as well as nucleotides derived from positions Xr in 
each of the other reference genomes r. 2) For each of these 
positions Xr occurring in column C, we also select the align- 
ment column Q of nucleoddes mapped to posidon Xr in the 
reference r (if this column Q is present in the pool). 3) All 
selected columns, that is, C and the Q for all other references, 
are then removed from the pool, and a consensus column is 
calculated by applying a simple majority rule. 4) This consen- 
sus column is then added to the collecdon of final alignment 
columns. We condnue to select random columns from the 
pool undl there are no columns left. This ensures that each 
reference genome position occurs in only one of the final 
alignment columns and that possible disagreements about 
which nucleodde from a given taxon should be aligned to a 
given reference position are resolved through a simple major- 
ity rule. 

Tree Building 

Based on the final set of DNA sequence alignment columns, 
the pipeline determines a phylogenedc tree by applying 
PhyML (Guindon et al. 2010, default parameters) or 
Dnapars (a maximum parsimony method; Felsenstein 
2005). We chose PhyML as it is optimized for speed in 
terms of handling large numbers of taxa as well as long 
sequence alignments. The maximum likelihood method 
PhyML is run with the general time-reversible (GTR) model 
of nucleodde evoludon and gamma distributed rate variadon 
by default. Dnapars from the Phylip program suite is run with 
its default setdngs. 

Output 

For each reference genome the output consists of a FASTA 
and a PHYLIP formatted file that contain an aligned set of 
orthologous sites (SNPs as well as nonpolymorphic sites), a 
tree file in Newick format, and muldple tab-delimited files 
(one for each query genome) containing the posidons on the 
reference genome to which the identified SNPs were aligned. 



Computational Resources 

The resources REALPHY requires depend mainly on the 
genome length, the number of genomes, and the number 
of references. The disk space required (~60MB per 
Mbase x number of genomes x number of references) and 
the compudng dme (~2min per Mbase x number of ge- 
nomes X number of references) are linearly dependent on 
these three factors. Furthermore, the amount of RAM re- 
quired depends primarily on the sequence length and the 
number of genomes (~250MB per Mbase x number of ge- 
nomes). The compudng dme required for mapping (which is 
performed by Bowtie2) will be affected by the repetitiveness 
of the genomes. As we have not yet tested REALPHY on data 
from large eukaryodc genomes with many repetitive regions, 
we currently cannot meaningfully estimate how computa- 
tional dmes will scale for such large genomes. 

Implementation 

The pipeline has been fully automated and is provided as a 
web server at http://realphy.unibas.ch (last accessed March 
18, 2014). In addidon, a stand-alone implementadon in Java 
can be downloaded from the same website. 

Sequence Simulation 

We simulated sequence evolution in a custom-made Java 
program along four-taxon trees in which branch lengths 
were systematically varied between 0.5% and 8% divergence 
(fig. 1). These sequences were 100,000-bp long, with a GC 
content of 50%. Evolution occurred with identical transition 
and transversion rates, that is, using the elementary jukes- 
Cantor model (jukes and Cantor 1969). For each parameter 
combination (i.e., the combinadon of branch lengths in the 
tree), we repeated the simulation 100 times. 

Phylogenetic Analyses 

Muldple sequence alignments were built as descnbed for the 
REALPHY algonthm. From these alignments as well as the 
true alignments, phylogenies were reconstructed using the 
maximum likelihood method PhyML with a GTR subsdtudon 
matrix and a gamma-distnbuted rate heterogeneity model 
(Guindon et al. 2010). 
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